智能论文笔记

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Yuyang Dong , Chuan Xiao , Takuma Nozawa , Masafumi Enomoto , Masafumi Oyamada

分类：人工智能 | 机器学习

2022-12-15

Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository or approximate solutions lacking precision. In this paper, we propose Deepjoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained language model (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to be joinable if they are close to each other in the vector space. Since the output of the PLM is fixed in length, the subsequent search procedure becomes independent of the column size. With a state-of-the-art approximate nearest neighbor search algorithm, the search time is logarithmic in the repository size. To train the model, we devise the techniques for preparing training data as well as data augmentation. The experiments on real datasets demonstrate that by training on a small subset of a corpus, Deepjoin generalizes to large datasets and its precision consistently outperforms other approximate solutions'. Deepjoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts. Moreover, when equipped with a GPU, Deepjoin is up to two orders of magnitude faster than existing solutions.

translated by 谷歌翻译

Distribution shifts, which often occur in the real world, degrade the accuracy of deep learning systems, and thus improving robustness is essential for practical applications. To improve robustness, we study an image enhancement method that generates recognition-friendly images without retraining the recognition model. We propose a novel image enhancement method, AugNet, which is based on differentiable data augmentation techniques and generates a blended image from many augmented images to improve the recognition accuracy under distribution shifts. In addition to standard data augmentations, AugNet can also incorporate deep neural network-based image transformation, which further improves the robustness. Because AugNet is composed of differentiable functions, AugNet can be directly trained with the classification loss of the recognition model. AugNet is evaluated on widely used image recognition datasets using various classification models, including Vision Transformer and MLP-Mixer. AugNet improves the robustness with almost no reduction in classification accuracy for clean images, which is a better result than the existing methods. Furthermore, we show that interpretation of distribution shifts using AugNet and retraining based on that interpretation can greatly improve robustness.

translated by 谷歌翻译

自动驾驶已经取得了很大进展，并在实际使用一步一步上引入。另一方面，个人移动性的概念也受欢迎，它专门为各个驱动程序的自主驾驶是一个新的步骤。然而，难以收集大型驾驶数据集，这基本上需要自主驾驶的学习，从个人移动性的各个驾驶员。此外，当驾驶员不熟悉个人移动性的操作时，数据集将包含非最佳数据。因此，本研究专注于为个人移动性的自主驱动方法，具有如此小而嘈杂，所谓的个人数据集。具体而言，我们基于TSAllis统计数据引入了一个新的损失函数，即权重梯度根据原始损耗功能，并允许我们在优化阶段排除嘈杂的数据。此外，我们改进了可视化技术，以验证驾驶员和控制器是否具有相同的感兴趣区域。从实验结果来看，我们发现传统的自主行驶由于个人数据集中的错误操作而无法正常驱动，并且感兴趣的区域与驾驶员的行为不同。相比之下，所提出的方法稳健地学习违反错误，并在将相似的区域注入到驱动程序时自动启动。附加视频也上传了YouTube：https://youtu.be/keq8-boxyqa

translated by 谷歌翻译